30 research outputs found

    The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages

    Full text link
    We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and keyword-assignment software. The corpus is encoded in XML, according to the Text Encoding Initiative Guidelines. Due to the large number of parallel texts in many languages, the JRC-Acquis is particularly suitable to carry out all types of cross-language research, as well as to test and benchmark text analysis software across different languages (for instance for alignment, sentence splitting and term extraction).Comment: A multilingual textual resource with meta-data freely available for download at http://langtech.jrc.it/JRC-Acquis.htm

    Multilingual person name recognition and transliteration

    Get PDF
    Nous présentons ici un outil de repérage des noms de personnes, à partir d’articles de la presse internationale, capable de reconnaître les différentes variantes d’un même nom. L’originalité de notre approche vient de l’identification des variantes de noms à travers les langues et systèmes d’écriture, grec, cyrillique et arabe compris. Étant donné notre contexte multilingue, nous utilisons une représentation interne standard de chaque nom ainsi qu’une même mesure de similarité (au lieu d’adopter l’approche bilingue habituelle de la translittération). Ce module fait partie d’un outil plus général qui analyse en moyenne 15.000 articles de journaux chaque jour, afin de regrouper les documents similaires, aussi bien dans une même langue que dans des langues différentes.We present an exploratory tool that extracts person names from multilingual news collections, matches name variants referring to the same person, and infers relationships between people based on the co-occurrence of their names in related news. A novel feature is the matching of name variants across languages and writing systems, including names written with the Greek, Cyrillic and Arabic writing system. Due to our highly multilingual setting, we use an internal standard representation for name representation and matching, instead of adopting the traditional bilingual approach to transliteration. This work is part of a news analysis system that clusters an average of 25,000 news articles per day to detect related news within the same and across different languages

    Les progrès dans la réalisation de la classification quantitative de la psychopathologie

    Get PDF
    Shortcomings of approaches to classifying psychopathology based on expert consensus have given rise to contemporary efforts to classify psychopathology quantitatively. In this paper, we review progress in achieving a quantitative and empirical classification of psychopathology. A substantial empirical literature indicates that psychopathology is generally more dimensional than categorical. When the discreteness versus continuity of psychopathology is treated as a research question, as opposed to being decided as a matter of tradition, the evidence clearly supports the hypothesis of continuity. In addition, a related body of literature shows how psychopathology dimensions can be arranged in a hierarchy, ranging from very broad "spectrum level'' dimensions, to specific and narrow clusters of symptoms. In this way, a quantitative approach solves the "problem of comorbidity'' by explicitly modeling patterns of co-occurrence among signs and symptoms within a detailed and variegated hierarchy of dimensional concepts with direct clinical utility. Indeed, extensive evidence pertaining to the dimensional and hierarchical structure of psychopathology has led to the formation of the Hierarchical Taxonomy of Psychopathology (HiTOP) Consortium. This is a group of 70 investigators working together to study empirical classification of psychopathology. In this paper, we describe the aims and current foci of the HiTOP Consortium. These aims pertain to continued research on the empirical organization of psychopathology; the connection between personality and psychopathology; the utility of empirically based psychopathology constructs in both research and the clinic; and the development of novel and comprehensive models and corresponding assessment instruments for psychopathology constructs derived from an empirical approach. (C) 2020 Published by Elsevier Masson SAS

    Multilingual person name recognition and transliteration

    No full text
    We present an exploratory tool that extracts person names from multilingual news collections, matches name variants referring to the same person, and infers relationships between people based on the co-occurrence of their names in related news. A novel feature is the matching of name variants across languages and writing systems, including names written with the Greek, Cyrillic and Arabic writing system. Due to our highly multilingual setting, we use an internal standard representation for name representation and matching, instead of adopting the traditional bilingual approach to transliteration. This work is part of a news analysis system that clusters an average of 25,000 news articles per day to detect related news within the same and across different languages

    The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages

    No full text
    We are presenting a new, unique and freely available parallel corpus available in all 20 official European Union (EU) languages, with additional documents available for some EU candidate countries. The average size is about 10 Million (check?) words per language. Pair-wise paragraph alignment information produced by two different aligners (Vanilla and HunAlign) is available for all 190+ lan-guage pair combinations. The UTF-8-encoded collection in XML format is accompanied by a tool to produce a bilingual para-graph-aligned parallel corpus for 190+ possible language pair combinations. Most texts have been manually classified according to the EUROVOC subject domains so that the collection can also be used to train and test multi-label classification algorithms and key-word-indexing software. Due to the considerable number of parallel texts in many languages, the JRC-Acquis is particularly suitable to test and benchmark text analysis software (for instance for alignment, sentence splitting and term extraction) across different languages.JRC.G.2-Support to external securit

    Multilingual Person Name Recognition and Transliteration

    No full text
    We present an exploratory tool that extracts person names from multilingual news collections, matches name variants referring to the same person, and infers relationships between people based on the co-occurrence of their names in related news. A novel feature is the matching of name variants across languages and writing systems, including names written with the Greek, Cyrillic and Arabic writing system. Due to our highly multilingual setting, we use an internal standard representation for name representation and matching, instead of adopting the traditional bilingual approach to transliteration. This work is part of a news analysis system that clusters an average of 25,000 news articles per day to detect related news within the same and across different languages.JRC.G.2-Support to external securit

    Geocoding Multilingual Texts: Recognition, Disambiguation and Visualisation

    No full text
    We are presenting a method to recognise geographical references in free text. Our tool must work on various languages with a minimum of language-dependant resources, except a gazetteer. The main difficulty is to disambiguate these place names by distinguishing places from persons and by selecting the most likely place out of a list of homographic place names world-wide. The system uses a number of language-independent clues and heuristics to disambiguate place name homographs. The final aim is to index texts with the countries and cities they mention and to automatically visualise this information on a geographical map.JRC.G.2-Support to external securit

    Geocoding multilingual texts: Recognition, disambiguation and visualisation

    No full text
    We are presenting a method to recognise geographical references in free text. Our tool must work on various languages with a minimum of language-dependent resources, except a gazetteer. The main difficulty is to disambiguate these place names by distinguishing places from persons and by selecting the most likely place out of a list of homographic place names world-wide. The system uses a number of language-independent clues and heuristics to disambiguate place name homographs. The final aim is to index texts with the countries and cities they mention and to automatically visualise this information on geographical maps using various tools. 1
    corecore